Usenet 1993 July

home *** CD-ROM | disk | FTP | other *** search

/ Usenet 1993 July / InfoMagic USENET CD-ROM July 1993.ISO / sources / unix / volume7 / regex < prev next >

Wrap

Text File | 1988-10-17 | 35.4 KB | 1,522 lines

Reply-To: oz@yetti.UUCP (Ozan Yigit) Subject: v07i021: Ed(1)/regex(3)-compatible reg. exp. package Newsgroups: mod.sources Approved: mirror!rs Submitted by: ihnp4!yetti!oz (Ozan Yigit) Mod.sources: Volume 7, Issue 21 Archive-name: regex #!/bin/sh # This is a shell archive, meaning: # 1. Remove everything above the #!/bin/sh line. # 2. Save the resulting text in a file. # 3. Execute the file with /bin/sh (not csh) to create the files: # README.txt # regex.3 # regex.c # re_fail.c # grep.c # makefile # This archive created: Thu Aug 14 17:41:27 1986 export PATH; PATH=/bin:$PATH echo shar: extracting "'README.txt'" '(1925 characters)' if test -f 'README.txt' then echo shar: over-writing existing file "'README.txt'" fi sed 's/^X//' << \SHAR_EOF > 'README.txt' XThis is a pre-release of a bunch of berkelix regex(3)/ed(1) Xcompatible regular-expression routines. X XThese routines are completely public domain. You can do whatever Xyou like with them, and hopefully you are professional enough not Xto strip out the authorship information, acknowledgements and Xreferences. X XThe reason for this being a *pre-release* is that I received a lot Xof useful suggestions about packaging, about additional routines etc. Xfrom a few people. I do not have too much time to do those changes Xright now, so I am putting this package out for those who needed Xit yesterday. Next release will include other routines, and better Xpackaging. X XThese routines are *not* tested under SysV, but they are tested Xunder PRO/Venix (2.0) and BSD 4.2. X XThose uEmacs V30 hackers should take notice of this package, Xas well as those who would love to replace the "restricted" regex Xroutines in Jove. [Since parts of these routines are based on Conroy's Xoriginal work, it would be only fitting to put these into uEmacs.] X XIn general, these routines run just as fast, or faster than regex library Xroutines under BSD 4.2. In some cases, they are slightly slower. I did not Xtry too hard to optimize the re_exec routine. X XCoding style is a la K&R, with lotsa short identifiers. I like it Xthat way. All flames should be fed to yetti!dragon. X XAcknowledgements: Henry Spencer, Hugh Redelmeier and Drew Sullivan made X a lot of important suggestions, some of which will be X incorporated into the next version. X X XPackage: X ---------------------------------------------------- X README.txt this file. X X regex.3 comprehensive man page X regex.c goodies X X re_fail.c sample failure routine X X grep.c a tiny grep for timing tests X X makefile guess. X ---------------------------------------------------- X Xenjoy.. oz [...!utzoo!yetti!oz] X [oz@yusol.BITNET || oz@yuyetti.BITNET] X X Dept. of Computer Science X York University SHAR_EOF if test 1925 -ne "`wc -c 'README.txt'`" then echo shar: error transmitting "'README.txt'" '(should have been 1925 characters)' fi echo shar: extracting "'regex.3'" '(7847 characters)' if test -f 'regex.3' then echo shar: over-writing existing file "'regex.3'" fi sed 's/^X//' << \SHAR_EOF > 'regex.3' X.TH regex 3 local X.DA Jun 19 1986 X.SH NAME Xre_comp, re_exec, re_subs, re_modw, re_fail \- regular expression handling X.SH ORIGIN XDept. of Computer Science X.br XYork University X.SH SYNOPSIS X.B char *re_comp(pat) X.br X.B char *pat; X.PP X.B re_exec(str) X.br X.B char *str; X.PP X.B re_subs(src, dst) X.br X.B char *src; X.br X.B char *dst; X.PP X.B void re_fail(msg, op) X.br X.B char *msg; X.br X.B char op; X.PP X.B void re_modw(str) X.br X.B char *str; X X.SH DESCRIPTION X.PP XThese functions implement X.IR ed (1)-style Xpartial regular expressions and supporting facilities. X.PP X.I Re_comp Xcompiles a pattern string into an internal form (a deterministic finite-state Xautomaton) to be executed by X.I re_exec Xfor pattern matching. X.I Re_comp Xreturns 0 if the pattern is compiled successfully, otherwise it returns an Xerror message string. If X.I re_comp Xis called with a 0 or a \fInull\fR string, it returns without changing the Xcurrently compiled regular expression. X.sp X.I Re_comp Xsupports the same limited set of X.I regular expressions Xfound in X.I ed Xand Berkeley X.IR regex (3) Xroutines: X.sp X.if n .in +1.6i X.if t .in +1i X.de Ti X.if n .ti -1.6i X.if t .ti -1i X.. X.if n .ta 0.8i +0.8i +0.8i X.if t .ta 0.5i +0.5i +0.5i X.Ti X[1] \fIchar\fR Matches itself, unless it is a special Xcharacter (meta-character): \fB. \\ [ ] * + ^ $\fR X X.Ti X[2] \fB.\fR Matches \fIany\fR character. X X.Ti X[3] \fB\\\fR Matches the character following it, except Xwhen followed by a digit 1 to 9, \fB(\fR, fB)\fR, \fB<\fR or \fB>\fR. X(see [7], [8] and [9]) It is used as an escape character for all Xother meta-characters, and itself. When used Xin a set ([4]), it is treated as an ordinary Xcharacter. X X.Ti X[4] \fB[\fIset\fB]\fR Matches one of the characters in the set. XIf the first character in the set is \fB^\fR, Xit matches a character NOT in the set. A Xshorthand X.IR S - E Xis used to specify a set of Xcharacters X.I S Xup to X.IR E , Xinclusive. The special Xcharacters \fB]\fR and \fB-\fR have no special Xmeaning if they appear as the first chars Xin the set. X.nf X examples: match: X [a-z] any lowercase alpha X [^]-] any char except ] and - X [^A-Z] any char except X uppercase alpha X [a-zA-Z0-9] any alphanumeric X.fi X X.Ti X[5] \fB*\fR Any regular expression form [1] to [4], followed by Xclosure char (*) matches zero or more matches of Xthat form. X X.Ti X[6] \fB+\fR Same as [5], except it matches one or more. X X.Ti X[7] A regular expression in the form [1] to [10], enclosed Xas \$\fIform\fR\$ matches what form matches. The enclosure Xcreates a set of tags, used for [8] and for Xpattern substitution in X.I re_subs. XThe tagged forms are numbered Xstarting from 1. X X.Ti X[8] A \\ followed by a digit 1 to 9 matches whatever a Xpreviously tagged regular expression ([7]) matched. X X.Ti X[9] \fB\\<\fR Matches the beginning of a \fIword\fR, Xthat is, an empty string followed by a Xletter, digit, or _ and not preceded by Xa letter, digit, or _ . X.Ti X \fB\\>\fR Matches the end of a \fIword\fR, Xthat is, an empty string preceded Xby a letter, digit, or _ , and not Xfollowed by a letter, digit, or _ . X X.Ti X[10] A composite regular expression X\fIxy\fR where \fIx\fR and \fIy\fR Xare in the form of [1] to [10] matches the longest Xmatch of \fIx\fR followed by a match for \fIy\fR. X X.Ti X[11] \fB^ $\fR a regular expression starting with a \fB^\fR character Xand/or ending with a \fB$\fR character, restricts the Xpattern matching to the beginning of the line, Xand/or the end of line [anchors]. Elsewhere in the Xpattern, \fB^\fR and \fB$\fR are treated as ordinary characters. X.if n .in -1.6i X.if t .in -1i X X.PP X.I Re_exec Xexecutes the internal form produced by X.I re_comp Xand searches the argument string for the regular expression described Xby the internal Xform. X.I Re_exec Xreturns 1 if the last regular expression pattern is matched within the string, X0 if no match is found. In case of an internal error (corrupted internal Xform), X.I re_exec Xcalls the user-supplied X.I re_fail Xand returns 0. X.PP XThe strings passed to both X.I re_comp Xand X.I re_exec Xmay have trailing or embedded newline characters. The strings Xmust be terminated by nulls. X.PP X.I Re_subs Xdoes X.IR ed -style Xpattern substitution, after a successful match is found by X.I re_exec. XThe source string parameter to X.I re_subs Xis copied to the destination string with the following interpretation; X.sp X.if n .in +1.6i X.if t .in +1i X.Ti X[1] & Substitute the entire matched string in the destination. X X.Ti X[2] \\\fIn\fR Substitute the substring matched by a tagged subpattern Xnumbered \fIn\fR, where \fIn\fR is between 1 to 9, inclusive. X X.Ti X[3] \\\fIchar\fR Treat the next character literally, Xunless the character is a digit ([2]). X.if n .in -1.6i X.if t .in -1i X X.PP XIf the copy operation with the substitutions is successful, X.I re_subs Xreturns 1. XIf the source string is corrupted, or the last call to X.I re_exec Xfails, it returns 0. X X.I Re_modw Xis used to Xadd new characters into an internal table to Xchange the re_exec's understanding of what Xa \fIword\fR should look like, when matching with \fB\\<\fR and \fB\\>\fR Xconstructs. If the string parameter is 0 or null string, Xthe table is reset back to the default, which contains \fBA-Z a-z 0-9 _\fR . X X.I Re_fail Xis a user-supplied routine to handle internal errors. X.I re_exec Xcalls X.I re_fail Xwith an error message string, and the opcode character that caused the error. XThe default X.I re_fail Xroutine simply prints the message and the opcode character to X.I stderr Xand invokes X.IR exit (2). X.SH EXAMPLES XIn the examples below, the X.I dfaform Xdescribes the internal form after the pattern is compiled. For additional Xdetails, refer to the sources. X.PP X.ta 0.5i +0.5i +0.5i X.nf Xfoo*.* X dfaform: CHR f CHR o CLO CHR o END CLO ANY END END X matches: \fIfo foo fooo foobar fobar foxx ...\fR X Xfo[ob]a[rz] X dfaform: CHR f CHR o CCL 2 o b CHR a CCL 2 r z END X matches: \fIfobar fooar fobaz fooaz\fR X Xfoo\\\\+ X dfaform: CHR f CHR o CHR o CHR \\ CLO CHR \\ END END X matches: \fIfoo\\ foo\\\\ foo\\\\\\ ...\fR X X\$foo\$[1-3]\\1 (same as foo[1-3]foo, but takes less internal space) X dfaform: BOT 1 CHR f CHR o CHR o EOT 1 CCL 3 1 2 3 REF 1 END X matches: \fIfoo1foo foo2foo foo3foo\fR X X\$fo.*\$-\\1 X dfaform: BOT 1 CHR f CHR o CLO ANY END EOT 1 CHR - REF 1 END X matches: \fIfoo-foo fo-fo fob-fob foobar-foobar ...\fR X.SH DIAGNOSTICS X.I Re_comp Xreturns one of the following strings if an error occurs: X.PP X.nf X.in +0.5i X\fINo previous regular expression, XEmpty closure, XIllegal closure, XCyclical reference, XUndetermined reference, XUnmatched \e(, XMissing ], XNull pattern inside \e(\e), XNull pattern inside \e<\e>, XToo many \e(\e) pairs, XUnmatched \e)\fP. X.in -0.5i X.fi X.SH REFERENCES X.if n .ta 3i X.if t .ta 1.8i X.nf X\fISoftware tools\fR Kernighan & Plauger X\fISoftware tools in Pascal\fR Kernighan & Plauger X\fIGrep sources\fR [rsx-11 C dist] David Conroy X\fIEd - text editor\fR Unix Programmer's Manual X\fIAdvanced editing on Unix\fR B. W. Kernighan X\fIRegExp sources\fR Henry Spencer X.fi X.SH "HISTORY AND NOTES" XThese routines are derived from various implementations Xfound in X.I "Software Tools" Xbooks, and David Conroy's X.I grep. XThey are NOT derived from licensed/restricted software. XFor more interesting/academic/complicated implementations, Xsee Henry Spencer's X.I regexp Xroutines (V8), or X.I "GNU Emacs" Xpattern Xmatching module. X.PP XThe X.I re_comp Xand X.I re_exec Xroutines perform X.I almost Xas well as their licensed counterparts, sometimes better. XIn very few instances, they Xare about 10% to 15% slower. X.SH AUTHOR XOzan S. Yigit (oz) X.br Xusenet: utzoo!yetti!oz X.br Xbitnet: oz@yusol || oz@yuyetti X.SH "SEE ALSO" Xed(1), ex(1), egrep(1), fgrep(1), grep(1), regex(3) X.SH BUGS XThese routines are \fIPublic Domain\fR. You can get them Xin source. X.br XThe internal storage for the \fIdfa form\fR is not checked for Xoverflows. Currently, it is 1024 bytes. X.br XOthers, no doubt. SHAR_EOF if test 7847 -ne "`wc -c 'regex.3'`" then echo shar: error transmitting "'regex.3'" '(should have been 7847 characters)' fi echo shar: extracting "'regex.c'" '(18779 characters)' if test -f 'regex.c' then echo shar: over-writing existing file "'regex.c'" fi sed 's/^X//' << \SHAR_EOF > 'regex.c' X/* X * regex - Regular expression pattern matching X * and replacement X * X * X * By: Ozan S. Yigit (oz) X * Dept. of Computer Science X * York University X * X * X * These routines are the PUBLIC DOMAIN equivalents X * of regex routines as found in 4.nBSD UN*X, with minor X * extensions. X * X * These routines are derived from various implementations X * found in software tools books, and Conroy's grep. They X * are NOT derived from licensed/restricted software. X * For more interesting/academic/complicated implementations, X * see Henry Spencer's regexp routines, or GNU Emacs pattern X * matching module. X * X * Routines: X * re_comp: compile a regular expression into X * a DFA. X * X * char *re_comp(s) X * char *s; X * X * re_exec: execute the DFA to match a pattern. X * X * int re_exec(s) X * char *s; X * X * re_modw change re_exec's understanding of what X * a "word" looks like (for \< and \>) X * by adding into the hidden word-character X * table. X * X * void re_modw(s) X * char *s; X * X * re_subs: substitute the matched portions in X * a new string. X * X * int re_subs(src, dst) X * char *src; X * char *dst; X * X * re_fail: failure routine for re_exec. X * X * void re_fail(msg, op) X * char *msg; X * char op; X * X * Regular Expressions: X * X * [1] char matches itself, unless it is a special X * character (metachar): . \ [ ] * + ^ $ X * X * [2] . matches any character. X * X * [3] \ matches the character following it, except X * when followed by a left or right round bracket, X * a digit 1 to 9 or a left or right angle bracket. X * (see [7], [8] and [9]) X * It is used as an escape character for all X * other meta-characters, and itself. When used X * in a set ([4]), it is treated as an ordinary X * character. X * X * [4] [set] matches one of the characters in the set. X * If the first character in the set is "^", X * it matches a character NOT in the set. A X * shorthand S-E is used to specify a set of X * characters S upto E, inclusive. The special X * characters "]" and "-" have no special X * meaning if they appear as the first chars X * in the set. X * examples: match: X * X * [a-z] any lowercase alpha X * X * [^]-] any char except ] and - X * X * [^A-Z] any char except uppercase X * alpha X * X * [a-zA-Z] any alpha X * X * [5] * any regular expression form [1] to [4], followed by X * closure char (*) matches zero or more matches of X * that form. X * X * [6] + same as [5], except it matches one or more. X * X * [7] a regular expression in the form [1] to [10], enclosed X * as $form$ matches what form matches. The enclosure X * creates a set of tags, used for [8] and for X * pattern substution. The tagged forms are numbered X * starting from 1. X * X * [8] a \ followed by a digit 1 to 9 matches whatever a X * previously tagged regular expression ([7]) matched. X * X * [9] \< a regular expression starting with a \< construct X * \> and/or ending with a \> construct, restricts the X * pattern matching to the beginning of a word, and/or X * the end of a word. A word is defined to be a character X * string beginning and/or ending with the characters X * A-Z a-z 0-9 and _. It must also be preceded and/or X * followed by any character outside those mentioned. X * X * [10] a composite regular expression xy where x and y X * are in the form [1] to [10] matches the longest X * match of x followed by a match for y. X * X * [11] ^ a regular expression starting with a ^ character X * $ and/or ending with a $ character, restricts the X * pattern matching to the beginning of the line, X * or the end of line. [anchors] Elsewhere in the X * pattern, ^ and $ are treated as ordinary characters. X * X * X * Acknowledgements: X * X * HCR's Hugh Redelmeier has been most helpful in various X * stages of development. He convinced me to include BOW X * and EOW constructs, originally invented by Rob Pike at X * the University of Toronto. X * X * References: X * Software tools Kernighan & Plauger X * Software tools in Pascal Kernighan & Plauger X * Grep [rsx-11 C dist] David Conroy X * ed - text editor Un*x Programmer's Manual X * Advanced editing on Un*x B. W. Kernighan X * RegExp routines Henry Spencer X * X * Notes: X * X * This implementation uses a bit-set representation for character X * classes for speed and compactness. Each character is represented X * by one bit in a 128-bit block. Thus, CCL or NCL always takes a X * constant 16 bytes in the internal dfa, and re_exec does a single X * bit comparison to locate the character in the set. X * X * Examples: X * X * pattern: foo*.* X * compile: CHR f CHR o CLO CHR o END CLO ANY END END X * matches: fo foo fooo foobar fobar foxx ... X * X * pattern: fo[ob]a[rz] X * compile: CHR f CHR o CCL 2 o b CHR a CCL bitset END X * matches: fobar fooar fobaz fooaz X * X * pattern: foo\\+ X * compile: CHR f CHR o CHR o CHR \ CLO CHR \ END END X * matches: foo\ foo\\ foo\\\ ... X * X * pattern: $foo$[1-3]\1 (same as foo[1-3]foo) X * compile: BOT 1 CHR f CHR o CHR o EOT 1 CCL bitset REF 1 END X * matches: foo1foo foo2foo foo3foo X * X * pattern: $fo.*$-\1 X * compile: BOT 1 CHR f CHR o CLO ANY END EOT 1 CHR - REF 1 END X * matches: foo-foo fo-fo fob-fob foobar-foobar ... X * X */ X X#define MAXDFA 1024 X#define MAXTAG 10 X X#define OKP 1 X#define NOP 0 X X#define CHR 1 X#define ANY 2 X#define CCL 3 X#define NCL 4 X#define BOL 5 X#define EOL 6 X#define BOT 7 X#define EOT 8 X#define BOW 9 X#define EOW 10 X#define REF 11 X#define CLO 12 X X#define END 0 X X/* X * The following defines are not meant X * to be changeable. They are for readibility X * only. X * X */ X#define MAXCHR 128 X#define CHRBIT 8 X#define BITBLK MAXCHR/CHRBIT X#define BLKIND 0170 X#define BITIND 07 X X#define ASCIIB 0177 X Xtypedef /*unsigned*/ char CHAR; X Xstatic int tagstk[MAXTAG]; /* subpat tag stack..*/ Xstatic CHAR dfa[MAXDFA]; /* automaton.. */ Xstatic int sta = NOP; /* status of lastpat */ X Xstatic CHAR bittab[BITBLK]; /* bit table for CCL */ X Xstatic void Xchset(c) register CHAR c; { bittab[((c)&BLKIND)>>3] |= 1<<((c)&BITIND); } X X#define badpat(x) return(*dfa = END, x) X#define store(x) *mp++ = x X Xchar * Xre_comp(pat) Xchar *pat; X{ X register char *p; /* pattern pointer */ X register CHAR *mp=dfa; /* dfa pointer */ X register CHAR *lp; /* saved pointer.. */ X register CHAR *sp=dfa; /* another one.. */ X X register int tagi = 0; /* tag stack index */ X register int tagc = 1; /* actual tag count */ X X register int n; X int c1, c2; X X if (!pat || !*pat) X if (sta) X return(0); X else X badpat("No previous regular expression"); X sta = NOP; X X for (p = pat; *p; p++) { X lp = mp; X switch(*p) { X X case '.': /* match any char.. */ X store(ANY); X break; X X case '^': /* match beginning.. */ X if (p == pat) X store(BOL); X else { X store(CHR); X store(*p); X } X break; X X case '$': /* match endofline.. */ X if (!*(p+1)) X store(EOL); X else { X store(CHR); X store(*p); X } X break; X X case '[': /* match char class..*/ X X if (*++p == '^') { X store(NCL); X p++; X } X else X store(CCL); X X if (*p == '-') /* real dash */ X chset(*p++); X if (*p == ']') /* real brac */ X chset(*p++); X while (*p && *p != ']') { X if (*p == '-' && *(p+1) && *(p+1) != ']') { X p++; X c1 = *(p-2) + 1; X c2 = *p++; X while (c1 <= c2) X chset(c1++); X } X#ifdef EXTEND X else if (*p == '\\' && *(p+1)) { X p++; X chset(*p++); X } X#endif X else X chset(*p++); X } X if (!*p) X badpat("Missing ]"); X X for (n = 0; n < BITBLK; bittab[n++] = (char) 0) X store(bittab[n]); X X break; X X case '*': /* match 0 or more.. */ X case '+': /* match 1 or more.. */ X if (p == pat) X badpat("Empty closure"); X lp = sp; /* previous opcode */ X if (*lp == CLO) /* equivalence.. */ X break; X switch(*lp) { X X case BOL: X case BOT: X case EOT: X case BOW: X case EOW: X case REF: X badpat("Illegal closure"); X default: X break; X } X X if (*p == '+') X for (sp = mp; lp < sp; lp++) X store(*lp); X X store(END); X store(END); X sp = mp; X while (--mp > lp) X *mp = mp[-1]; X store(CLO); X mp = sp; X break; X X case '\\': /* tags, backrefs .. */ X switch(*++p) { X X case '(': X if (tagc < MAXTAG) { X tagstk[++tagi] = tagc; X store(BOT); X store(tagc++); X } X else X badpat("Too many \$\$ pairs"); X break; X case ')': X if (*sp == BOT) X badpat("Null pattern inside \$\$"); X if (tagi > 0) { X store(EOT); X store(tagstk[tagi--]); X } X else X badpat("Unmatched \\)"); X break; X case '<': X store(BOW); X break; X case '>': X if (*sp == BOW) X badpat("Null pattern inside \\<\\>"); X store(EOW); X break; X case '1': X case '2': X case '3': X case '4': X case '5': X case '6': X case '7': X case '8': X case '9': X n = *p-'0'; X if (tagi > 0 && tagstk[tagi] == n) X badpat("Cyclical reference"); X if (tagc > n) { X store(REF); X store(n); X } X else X badpat("Undetermined reference"); X break; X#ifdef EXTEND X case 'b': X store(CHR); X store('\b'); X break; X case 'n': X store(CHR); X store('\n'); X break; X case 'f': X store(CHR); X store('\f'); X break; X case 'r': X store(CHR); X store('\r'); X break; X case 't': X store(CHR); X store('\t'); X break; X#endif X default: X store(CHR); X store(*p); X } X break; X X default : /* an ordinary char */ X store(CHR); X store(*p); X break; X } X sp = lp; X } X if (tagi > 0) X badpat("Unmatched \\("); X store(END); X sta = OKP; X return(0); X} X X Xstatic char *bol; Xstatic char *bopat[MAXTAG]; Xstatic char *eopat[MAXTAG]; Xchar *pmatch(); X X/* X * re_exec: X * execute dfa to find a match. X * X * special cases: (dfa[0]) X * BOL X * Match only once, starting from the X * beginning. X * CHR X * First locate the character without X * calling pmatch, and if found, call X * pmatch for the remaining string. X * END X * re_comp failed, poor luser did not X * check for it. Fail fast. X * X * If a match is found, bopat[0] and eopat[0] are set X * to the beginning and the end of the matched fragment, X * respectively. X * X */ X Xint Xre_exec(lp) Xregister char *lp; X{ X register char c; X register char *ep = 0; X register CHAR *ap = dfa; X X bol = lp; X X bopat[0] = 0; X bopat[1] = 0; X bopat[2] = 0; X bopat[3] = 0; X bopat[4] = 0; X bopat[5] = 0; X bopat[6] = 0; X bopat[7] = 0; X bopat[8] = 0; X bopat[9] = 0; X X switch(*ap) { X X case BOL: /* anchored: match from BOL only */ X ep = pmatch(lp,ap); X break; X case CHR: /* ordinary char: locate it fast */ X c = *(ap+1); X while (*lp && *lp != c) X lp++; X if (!*lp) /* if EOS, fail, else fall thru. */ X return(0); X default: /* regular matching all the way. */ X while (*lp) { X if ((ep = pmatch(lp,ap))) X break; X lp++; X } X break; X case END: /* munged automaton. fail always */ X return(0); X } X if (!ep) X return(0); X X bopat[0] = lp; X eopat[0] = ep; X return(1); X} X X/* X * pmatch: X * internal routine for the hard part X * X * This code is mostly snarfed from an early X * grep written by David Conroy. The backref and X * tag stuff, and various other mods are by oZ. X * X * special cases: (dfa[n], dfa[n+1]) X * CLO ANY X * We KNOW ".*" will match ANYTHING X * upto the end of line. Thus, go to X * the end of line straight, without X * calling pmatch recursively. As in X * the other closure cases, the remaining X * pattern must be matched by moving X * backwards on the string recursively, X * to find a match for xy (x is ".*" and X * y is the remaining pattern) where X * the match satisfies the LONGEST match X * for x followed by a match for y. X * CLO CHR X * We can again scan the string forward X * for the single char without recursion, X * and at the point of failure, we execute X * the remaining dfa recursively, as X * described above. X * X * At the end of a successful match, bopat[n] and eopat[n] X * are set to the beginning and end of subpatterns matched X * by tagged expressions (n = 1 to 9). X * X */ X Xextern void re_fail(); X X/* X * character classification table for word boundary X * operators BOW and EOW. the reason for not using X * ctype macros is that we can let the user add into X * our own table. see re_modw. This table is not in X * the bitset form, since we may wish to extend it X * in the future for other character classifications. X * X * TRUE for 0-9 A-Z a-z _ X */ Xstatic char chrtyp[MAXCHR] = { X 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, X 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, X 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, X 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, X 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, X 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, X 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, X 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, X 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, X 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, X 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, X 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, X 1, 1, 1, 0, 0, 0, 0, 0 X }; X X#define inascii(x) (0177&(x)) X#define iswordc(x) chrtyp[inascii(x)] X#define isinset(x,y) ((x)[((y)&BLKIND)>>3] & (1<<((y)&BITIND))) X X/* X * skip values for CLO XXX to skip past the closure X * X */ X X#define ANYSKIP 2 /* CLO ANY END ... */ X#define CHRSKIP 3 /* CLO CHR chr END ... */ X#define CCLSKIP 18 /* CLO CCL 16bytes END ... */ X Xstatic char * Xpmatch(lp, ap) Xregister char *lp; Xregister CHAR *ap; X{ X register char *e; /* extra pointer for CLO */ X register char *bp; /* beginning of subpat.. */ X register char *ep; /* ending of subpat.. */ X register int op, c, n; X char *are; /* to save the line ptr. */ X X while ((op = *ap++) != END) X switch(op) { X X case CHR: X if (*lp++ != *ap++) X return(0); X break; X case ANY: X if (!*lp++) X return(0); X break; X case CCL: X c = *lp++; X if (!isinset(ap,c)) X return(0); X ap += BITBLK; X break; X case NCL: X c = *lp++; X if (isinset(ap,c)) X return(0); X ap += BITBLK; X break; X case BOL: X if (lp != bol) X return(0); X break; X case EOL: X if (*lp) X return(0); X break; X case BOT: X bopat[*ap++] = lp; X break; X case EOT: X eopat[*ap++] = lp; X break; X case BOW: X if (!(lp!=bol && iswordc(lp[-1])) && iswordc(*lp)) X break; X return(0); X case EOW: X if ((lp!=bol && iswordc(lp[-1])) && !iswordc(*lp)) X break; X return(0); X case REF: X n = *ap++; X bp = bopat[n]; X ep = eopat[n]; X while (bp < ep) X if (*bp++ != *lp++) X return(0); X break; X case CLO: X are = lp; X switch(*ap) { X X case ANY: X while (*lp) X lp++; X n = ANYSKIP; X break; X case CHR: X c = *(ap+1); X while (*lp && c == *lp) X lp++; X n = CHRSKIP; X break; X case CCL: X case NCL: X while (*lp && (e = pmatch(lp, ap))) X lp = e; X n = CCLSKIP; X break; X default: X re_fail("closure: bad dfa.", *ap); X return(0); X } X X ap += n; X X while (lp >= are) { X if (e = pmatch(lp, ap)) X return(e); X --lp; X } X return(0); X default: X re_fail("re_exec: bad dfa.", op); X return(0); X } X return(lp); X} X X/* X * re_modw: X * add new characters into the word table to X * change the re_exec's understanding of what X * a word should look like. Note that we only X * accept additions into the word definition. X * X * If the string parameter is 0 or null string, X * the table is reset back to the default, which X * contains A-Z a-z 0-9 _. [We use the compact X * bitset representation for the default table] X * X */ X Xstatic char deftab[16] = { X 0, 0, 0, 0, 0, 0, 377, 003, 376, 377, 377, 207, X 376, 377, 377, 007 X}; X Xvoid Xre_modw(s) Xregister char *s; X{ X register int i; X X if (!s || !*s) { X for (i = 0; i < MAXCHR; i++) X if (!isinset(deftab,i)) X iswordc(i) = 0; X } X else X while(*s) X iswordc(*s++) = 1; X} X X/* X * re_subs: X * substitute the matched portions of the src in X * dst. X * X * & substitute the entire matched pattern. X * X * \digit substitute a subpattern, with the given X * tag number. Tags are numbered from 1 to X * 9. If the particular tagged subpattern X * does not exist, null is substituted. X * X */ Xint Xre_subs(src, dst) Xregister char *src; Xregister char *dst; X{ X register char c; X register int pin; X register char *bp; X register char *ep; X X if (!*src || !bopat[0]) X return(0); X X while (c = *src++) { X switch(c) { X X case '&': X pin = 0; X break; X X case '\\': X c = *src++; X if (c >= '0' && c <= '9') { X pin = c - '0'; X break; X } X X default: X *dst++ = c; X continue; X } X X if ((bp = bopat[pin]) && (ep = eopat[pin])) { X while (*bp && bp < ep) X *dst++ = *bp++; X if (bp < ep) X return(0); X } X } X *dst = (char) 0; X return(1); X} X X#ifdef DEBUG X/* X * symbolic - produce a symbolic dump of the X * dfa X */ Xsymbolic(s) Xchar *s; X{ X printf("pattern: %s\n", s); X printf("dfacode:\n"); X dfadump(dfa); X} X Xstatic Xdfadump(ap) XCHAR *ap; X{ X register int n; X X while (*ap != END) X switch(*ap++) { X case CLO: X printf("CLOSURE"); X dfadump(ap); X switch(*ap) { X case CHR: X n = CHRSKIP; X break; X case ANY: X n = ANYSKIP; X break; X case CCL: X case NCL: X n = CCLSKIP; X break; X } X ap += n; X break; X case CHR: X printf("\tCHR %c\n",*ap++); X break; X case ANY: X printf("\tANY .\n"); X break; X case BOL: X printf("\tBOL -\n"); X break; X case EOL: X printf("\tEOL -\n"); X break; X case BOT: X printf("BOT: %d\n",*ap++); X break; X case EOT: X printf("EOT: %d\n",*ap++); X break; X case BOW: X printf("BOW\n"); X break; X case EOW: X printf("EOW\n"); X break; X case REF: X printf("REF: %d\n",*ap++); X break; X case CCL: X printf("\tCCL ["); X for (n = 0; n < MAXCHR; n++) X if (isinset(ap,(CHAR)n)) X printf("%c",n); X printf("]\n"); X ap += BITBLK; X break; X case NCL: X printf("\tNCL ["); X for (n = 0; n < MAXCHR; n++) X if (isinset(ap,(CHAR)n)) X printf("%c",n); X printf("]\n"); X ap += BITBLK; X break; X default: X printf("bad dfa. opcode %o\n", ap[-1]); X exit(1); X break; X } X} X#endif SHAR_EOF if test 18779 -ne "`wc -c 'regex.c'`" then echo shar: error transmitting "'regex.c'" '(should have been 18779 characters)' fi echo shar: extracting "'re_fail.c'" '(292 characters)' if test -f 're_fail.c' then echo shar: over-writing existing file "'re_fail.c'" fi sed 's/^X//' << \SHAR_EOF > 're_fail.c' X X#ifdef vms X#include stdio X#else X#include <stdio.h> X#endif X X/* X * re_fail: X * internal error handler for re_exec. X * X * should probably do something like a X * longjump to recover gracefully. X */ Xvoid Xre_fail(s, c) Xchar *s; Xchar c; X{ X fprintf(stderr, "%s [opcode %o]\n", s, c); X exit(1); X} SHAR_EOF if test 292 -ne "`wc -c 're_fail.c'`" then echo shar: error transmitting "'re_fail.c'" '(should have been 292 characters)' fi echo shar: extracting "'grep.c'" '(1014 characters)' if test -f 'grep.c' then echo shar: over-writing existing file "'grep.c'" fi sed 's/^X//' << \SHAR_EOF > 'grep.c' X#ifdef vms X#include stdio X#else X#include <stdio.h> X#endif X X/* X * Rudimentary grep to test regex routines. X * X * DEBUG is only applicable X * to oZ version of regex. Make sure to X * compile regex.c with -DDEBUG as well. X * X */ Xextern char *re_comp(); Xextern re_exec(); X Xmain(argc,argv) Xchar *argv[]; X{ X char str[512]; X FILE *f; X register int n; X register char *p; X X if (argc < 2) X error("usage: grep pat [file]"); X X if ((p = re_comp(argv[1])) != 0) { X printf("\t%s: %s\n", p, argv[1]); X exit(1); X } X#ifdef DEBUG X symbolic(argv[1]); X#endif X if (p = argv[2]) { X if ((f = fopen(p, "r")) == NULL) { X printf("cannot open %s\n", argv[2]); X exit(1); X } X while ((n = load(str, f)) != EOF) X if (re_exec(str)) X printf("%s\n",str); X X } X exit(0); X} Xload (s, f) Xchar *s; XFILE *f; X{ X register int c; X static int lineno = 0; X X while ((c = getc(f)) != '\n' && c != EOF) X *s++ = c; X if (c == EOF) X return (EOF); X *s = (char) 0; X return (++lineno); X} X Xerror(s) char *s ; X{ X fprintf(stderr,"%s\n",s); X exit(1); X} SHAR_EOF if test 1014 -ne "`wc -c 'grep.c'`" then echo shar: error transmitting "'grep.c'" '(should have been 1014 characters)' fi echo shar: extracting "'makefile'" '(2472 characters)' if test -f 'makefile' then echo shar: over-writing existing file "'makefile'" fi sed 's/^X//' << \SHAR_EOF > 'makefile' X# X# regex routines (PUBLIC DOMAIN) X# X# by: Ozan S. Yigit (oz) X# Dept. of Computer Science X# York University X# X# Applicable to BSD: X# X# If you have the generic(?) regex routines X# than you can compare the timings of these X# routines to the generic ones by: X# X# make times X# X# which will create two rudimentary greps X# lgrep and ogrep. lgrep will use the generic X# regex routines, and ogrep will use oz version X# of regex. Several patterns will be searched X# in /usr/dict/words, and the output of the greps X# will be compared. [for reasons of sanity] X# X# Surely, you will note, the time tests are somewhat X# biased, since /usr/dict/words contain *short* lines, X# thereby the real-life case of searching a complex X# expression within a long line is not covered. You X# will find, however, that the PD regex routines will X# search *as fast* as the generic ones in most X# cases, and about 10% slower in some cases, when X# tested with files containing *long* lines. X# XCFLAGS = -O X# X# test patterns X# XPAT1 = '[d-f]zz*.*m' XPAT2 = 'fo[ornt]*.*b[a-d]*' XPAT3 = '.th.' XPAT4 = '$ab$[a-d]*\1' XPAT5 = 'burp' X XFILE = /usr/dict/words XOUTD = /tmp/ X XRSRC = regex.o re_fail.o X Xregex: $(RSRC) X# X# insert code to put these into a library X# Xrlint: X lint -phc regex.c Xdebug: X cc -O -DDEBUG -o ogrep grep.c regex.c re_fail.c X Xlgrep: grep.o X cc -o lgrep grep.o X Xogrep: grep.o $(RSRC) X cc -o ogrep grep.o $(RSRC) X Xtimes: lgrep ogrep X @echo generic regex vs oz regex X# @echo pattern: $(PAT1) X time ogrep $(PAT1) $(FILE) >$(OUTD)ogrep.out X time lgrep $(PAT1) $(FILE) >$(OUTD)lgrep.out X @echo output differences: X -diff $(OUTD)ogrep.out $(OUTD)lgrep.out X @echo "---" X# @echo pattern: $(PAT2) X time ogrep $(PAT2) $(FILE) >$(OUTD)ogrep.out X time lgrep $(PAT2) $(FILE) >$(OUTD)lgrep.out X @echo output differences: X -diff $(OUTD)ogrep.out $(OUTD)lgrep.out X @echo "---" X# echo pattern: $(PAT3) X time ogrep $(PAT3) $(FILE) >$(OUTD)ogrep.out X time lgrep $(PAT3) $(FILE) >$(OUTD)lgrep.out X @echo output differences: X -diff $(OUTD)ogrep.out $(OUTD)lgrep.out X @echo "---" X# echo pattern: $(PAT4) X time ogrep $(PAT4) $(FILE) >$(OUTD)ogrep.out X time lgrep $(PAT4) $(FILE) >$(OUTD)lgrep.out X @echo output differences: X -diff $(OUTD)ogrep.out $(OUTD)lgrep.out X @echo "---" X# echo pattern: $(PAT5) X time ogrep $(PAT5) $(FILE) >$(OUTD)ogrep.out X time lgrep $(PAT5) $(FILE) >$(OUTD)lgrep.out X @echo output differences: X -diff $(OUTD)ogrep.out $(OUTD)lgrep.out X @echo "---" X @rm $(OUTD)ogrep.out $(OUTD)lgrep.out SHAR_EOF if test 2472 -ne "`wc -c 'makefile'`" then echo shar: error transmitting "'makefile'" '(should have been 2472 characters)' fi # End of shell archive exit 0